Huey Kwik
## Parsed with column specification:
## cols(
## cmte_id = col_character(),
## cand_id = col_character(),
## cand_nm = col_character(),
## contbr_nm = col_character(),
## contbr_city = col_character(),
## contbr_st = col_character(),
## contbr_zip = col_integer(),
## contbr_employer = col_character(),
## contbr_occupation = col_character(),
## contb_receipt_amt = col_double(),
## contb_receipt_dt = col_character(),
## receipt_desc = col_character(),
## memo_cd = col_character(),
## memo_text = col_character(),
## form_tp = col_character(),
## file_num = col_integer(),
## tran_id = col_character(),
## election_tp = col_character()
## )
## Joining, by = "zip"
## [1] 1072770 25
## Classes 'tbl_df', 'tbl' and 'data.frame': 1072770 obs. of 25 variables:
## $ cmte_id : chr "C00575795" "C00575795" "C00575795" "C00577130" ...
## $ cand_id : chr "P00003392" "P00003392" "P00003392" "P60007168" ...
## $ cand_nm : Factor w/ 25 levels "Bush, Jeb","Carson, Benjamin S.",..: 4 4 4 20 20 20 20 4 20 20 ...
## $ contbr_nm : chr "AULL, ANNE" "CARROLL, MARYJEAN" "GANDARA, DESIREE" "LEE, ALAN" ...
## $ contbr_city : chr "LARKSPUR" "CAMBRIA" "FONTANA" "CAMARILLO" ...
## $ contbr_st : chr "CA" "CA" "CA" "CA" ...
## $ contbr_zip : int 949391913 934284638 923371507 930111214 902784310 902784310 920842849 926372912 926833846 949522729 ...
## $ contbr_employer : chr "N/A" "N/A" "N/A" "AT&T GOVERNMENT SOLUTIONS" ...
## $ contbr_occupation: chr "RETIRED" "RETIRED" "RETIRED" "SOFTWARE ENGINEER" ...
## $ contb_receipt_amt: num 50 200 5 40 35 100 25 40 10 15 ...
## $ contb_receipt_dt : Date, format: "2016-04-26" "2016-04-20" ...
## $ receipt_desc : chr NA NA NA NA ...
## $ memo_cd : chr "X" "X" "X" NA ...
## $ memo_text : chr "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* HILLARY VICTORY FUND" "* EARMARKED CONTRIBUTION: SEE BELOW" ...
## $ form_tp : chr "SA18" "SA18" "SA18" "SA17A" ...
## $ file_num : int 1091718 1091718 1091718 1077404 1077404 1077404 1077404 1091718 1077404 1077404 ...
## $ tran_id : chr "C4768722" "C4747242" "C4666603" "VPF7BKWA097" ...
## $ election_tp : Factor w/ 3 levels "G2016","P2016",..: 2 2 2 2 2 2 2 2 2 2 ...
## $ cand_last_name : Factor w/ 25 levels "Bush","Carson",..: 4 4 4 20 20 20 20 4 20 20 ...
## $ party : Ord.factor w/ 5 levels "Democratic"<"Republican"<..: 1 1 1 1 1 1 1 1 1 1 ...
## $ zip : chr "94939" "93428" "92337" "93011" ...
## $ city : chr "Larkspur" "Cambria" "Fontana" "Camarillo" ...
## $ state : chr "CA" "CA" "CA" "CA" ...
## $ latitude : num 37.9 35.6 34 34 33.9 ...
## $ longitude : num -123 -121 -117 -119 -118 ...
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## -10000.0 15.0 27.0 122.7 100.0 10800.0 1
## [1] "Number of negative contributions: 10617"
Observations:
contb_receipt_amt in the dataset.## # A tibble: 8 × 5
## contbr_nm contb_receipt_dt contb_receipt_amt tran_id
## <chr> <date> <dbl> <chr>
## 1 HOROWITZ, DAVID 2015-07-06 10800 SA17A.123258
## 2 HOROWITZ, DAVID 2015-07-06 -5400 SA17A.123258.0
## 3 HOROWITZ, MICHELLE 2015-07-06 5400 SA17A.123258.1
## 4 HOROWITZ, MICHELLE 2015-07-06 -2700 SA17A.123258.2
## 5 HOROWITZ, MICHELLE 2015-07-06 2700 SA17A.123258.3
## 6 HOROWITZ, DAVID 2015-07-06 -2700 SA17A.123258.4
## 7 HOROWITZ, DAVID 2015-07-06 2700 SA17A.123258.5
## 8 HOROWITZ, MICHELLE 2015-07-06 5400 SA17A.123263
## # ... with 1 more variables: election_tp <fctr>
I looked at some examples of contributions that were above $2700 and came across David and Michelle Horowitz. They appear to be a couple who donated to Scott Walker’s campaign.
Summing up contb_receipt_amt, we get $10,800. Is this an instance of people contributing over the limit?
From what I can tell, this instead seems to be double-counting! The FEC provides an Individual Contributor Search, which lets us look at each contributor record in more detail.
From there, I was able to piece this story:
If this story is true, then these donations are within the campaign contribution limits for primary and general elections. From an election integrity standpoint, this is good.
However, when doing analysis of this data, we should be aware of this discrepancy in our analysis. A contribution like Michelle Horowitz’s reattributed $5,400 may be double-counted in our analysis. Also, a large contribution of $10,800 by David Horowitz will count towards calculating the mean, even though it gets reattributed into smaller contributions later.
Democrats had the most contributions by far, which makes sense in California.
## Joining, by = "zip"
## Warning: Removed 10 rows containing missing values (geom_point).
As you can see, there are contributions from zipcodes that are outside of California. However, since all the state fields in the data set are recorded as California, I think these might be cases where the data was entered incorrectly.
Let’s restrict our visualization to known California zipcodes:
## Joining, by = "zip"
It seems like most of the contributions are centered around the major cities in California: Los Angeles, San Francisco, San Diego, and Sacramento.
Above, we look at hte top 10 occupations and employers in our dataset.
There are 1,073,271 records in the dataset with 18 features.
The features are as follows:
Factors: Candidate name, election type (Primary 2016, General 2016, or Primary 2020)
Other observations:
I’m mostly interested in looking at patterns/differences in contributions among different candidates. So for me, the main features of interest are candidate name, contribution amount, and date.
Zipcode, employers, occupation, and party might provide other angles into the data.
I created a Party variable to represent the political party for each canddiate.
I also merged in data from the zipcode dataset: zip, city, state, latitude, and longitude.
When histogramming the contribution amounts, I used a log scale since one of the bins was really large. This made it easier to see the rest of the data.
## contb_receipt_amt
## Min. :-5700.00
## 1st Qu.: 15.00
## Median : 27.00
## Mean : 50.57
## 3rd Qu.: 50.00
## Max. :10000.00
## contb_receipt_amt
## Min. :-5400.0
## 1st Qu.: 15.0
## Median : 25.0
## Mean : 146.3
## 3rd Qu.: 100.0
## Max. : 7300.0
## NA's :1
Since Sanders is often portrayed as the more progressive, blue-collar candidate, it is interesting to see that Clinton’s median donation is actually lower. It is interesting that Clinton and Sanders average donation amounts are roughly the same. Clinton’s median donation is actually lower, i.e. $27 vs. $25. Of course, this data does not include donations to Political Action Committees, so that could be a factor.
For the box plots, I sorted the candidates from highest number of donations to lowest.
In the first box plot, we can see that some candidates actually have many donations above the individual limit of $2700. Many also have negative donations, which could either be refunds or reattributions.
In the second box plot, I excluded negative contributions to see if we could see any other patterns.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Observations:
Last donation date could be a proxy variable for how long a campaign lasts. As we can see, this is positively correlated with the total number of donations.
ggplot(data = contributions.candidates, aes(x = total, y = n)) +
geom_point() +
scale_x_log10() +
scale_y_log10()
primaries <- inner_join(contributions.candidates, votes, by = c("cand_last_name" = "last_name"))
## Warning in inner_join_impl(x, y, by$x, by$y, suffix$x, suffix$y): joining
## factors with different levels, coercing to character vector
ggplot(data = primaries, aes(x = n, y = delegates)) +
geom_point() +
scale_x_log10() +
facet_wrap(~party)
ggplot(data = primaries, aes(x = n, y = votes)) +
geom_point() +
scale_x_log10() +
scale_y_log10() +
facet_wrap(~party)
There seems to be a positive correlation between the log number of votes and the log number of donations.
This visualization again shows concentration of donation activity around cities, but also shows more domination by Democrats.
Over 60% of donations to Bernie Sanders were from people whose occupation was “Not Employed”.
Number of contributions and total contributions are positively correlated.
Both of these features are positively correlated with number of votes and number of delegates in the primary election.
# Group by city to consolidate zipcodes.
contributions.city.party <- contributions.ca %>%
group_by(city, party) %>%
summarise(
total = sum(contb_receipt_amt),
mean = mean(contb_receipt_amt),
median = median(contb_receipt_amt),
latitude = mean(latitude),
longitude = mean(longitude),
n = n())
ca_base <- ggplot() +
geom_polygon(data = ca_df, aes(x = long, y = lat, group = group), fill=NA, color="black") +
coord_fixed(1.3) +
scale_size(range = c(0, 6))
# All five parties at once.
ca_base +
geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, color = party, size = n), alpha = 1/2) +
scale_color_manual(values = party.colors)
ca_base +
geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = total, color = party), alpha = 1/2) +
scale_color_manual(values = c("#3333FF", "#EE3523", "#FED105", "#17aa5c", "#DDDDDD"))
## Warning: Removed 1 rows containing missing values (geom_point).
# Just Democratic.
ca_base +
geom_point(data = contributions.city.party %>% filter(party == "Democratic"),
aes(x = longitude, y = latitude, size = n),
color = party.colors[1],
alpha = 1/2)
# Just Republican.
ca_base +
geom_point(data = contributions.city.party %>% filter(party == "Republican"),
aes(x = longitude, y = latitude, size = n),
color = party.colors[2],
alpha = 1/2)
# Number of contributions, facet by party.
ca_base +
geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = n, color=party), alpha = 1/2) +
scale_color_manual(values = party.colors) +
facet_wrap(~party)
# Total of contributions, facet by party.
ca_base +
geom_point(data = contributions.city.party, aes(x = longitude, y = latitude, size = total, color=party), alpha = 1/2) +
scale_color_manual(values = party.colors) +
facet_wrap(~party)
## Warning: Removed 1 rows containing missing values (geom_point).
contributions.city.cand <- contributions.ca %>%
filter(election_tp == "P2016") %>%
group_by(city, cand_last_name) %>%
summarise(
total = sum(contb_receipt_amt),
mean = mean(contb_receipt_amt),
median = median(contb_receipt_amt),
latitude = mean(latitude),
longitude = mean(longitude),
n = n())
dems = c("Clinton", "Sanders")
ca_base +
geom_point(data = contributions.city.cand %>% filter(cand_last_name %in% dems), aes(x = longitude, y = latitude, size = total, color = cand_last_name))
ca_base +
geom_point(data = contributions.city.cand %>% filter(cand_last_name %in% dems), aes(x = longitude, y = latitude, size = total, color = cand_last_name)) +
facet_wrap(~cand_last_name)
The Democratic and Republican parties receive donations from similar areas, with the Democrats receiving more donations from more largely populated areas.
I expected to see some difference between Clinton and Sanders support geographically.